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Summary. In many application areas, data are collected on a categorical response and high-dimensional 
categorical predictors, with the goals being to build a parsimonious model for classification while doing 
inferences on the important predictors. In settings such as genomics, there can be complex interac- 
tions among the predictors. By using a carefully-structured Tucker factorization, we define a model 
that can characterize any conditional probability, while facilitating variable selection and modeling of 
higher-order interactions. Following a Bayesian approach, we propose a IVIarkov chain IVIonte Carlo 
algorithm for posterior computation accommodating uncertainty in the predictors to be included. Under 
near sparsity assumptions, the posterior distribution for the conditional probability is shown to achieve 
close to the parametric rate of contraction even in ultra high-dimensional settings. The methods are 
illustrated using simulation examples and biomedical applications. 
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1. Introduction 

Classification problems involving high-dimensional categorical predictors have become common in 
a variety of application areas, with the goals being not only to build an accurate classifier but 
also to identify a sparse subset of important predictors. For example, genetic epidemiology studies 
commonly focus on relating a categorical disease phenotype to single nucleotide polymorphisms 
encoding whether an individual has 0, 1 or 2 copies of the minor allele at a large number of loci 
across the genome. In such applications, it is expected that interactions play an important role, 
but there is a lack of statistical methods for identifying important predictors that may act through 
both main effects and interactions from a high-dimensional set of candidates. Our goal is to develop 
nonparametric Bayesian methods for addressing this gap. 

There is a rich literature on methods for prediction and variable selection from high or ultra 
high-dimensional predictors with a categorical response. The most common strategy would rely 
on logistic regression with the linear predictor having the form a:^/?, with Xi = {xn, . . . ,Xip)' de- 
noting the predictors and = . . . , regression coefficients. In high-dimensional cases in 
which p is the same order of n or even p > n, classical methods such as maximum likelihood 
break down but there is a rich variety of alternatives ranging from penalized re gression to Bayesian 
variable selection. Pop ular methods include Li penalization ( Tibshirani. 19961) and the elastic net 



(|Zou and Hastie. 20051) . which combines Li and L2 penalties to accommodate p ^ n cases and 



allow simultaneous selection of correlated sets of predictors. For efficient Li regularization in gen 



eralized linear models including logistic regression. iPark and Hastie (20071 ) proposed a solution path 
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method. iGenkin et al. (20071 ) propose a related Bayesian approach for high-dimensional logistic re- 



gression under Laplace priors. IWu et al. (20091) applied Li penalized logistic regression to genome 



wide association stud ies. Potentially, re lated methods can be applied to identify main effects and 
epistatic interactions ( Yang et al.. 2010( ). but direct inclusion of interactions within a logistic model 



creates a daunting dimensionality problem limiting attention to low-order interactions and modest 
numbers of predictors. 

These limitations have motiva ted a rich variety of n onparametric classifiers, inc luding classifica - 
tion and regression trees (CART) ( Breiman et al.. 19841 ) and random forests (RFs) ( Breiman. 200 ll ). 



CART partitions the predictor space so that samples within the same partition set have relatively 
homogeneous outcomes. CART can capture complex interactions and has easy interpretation, but 
tends to be unstable computationally and lead to low classification accuracy. RFs extend CART 
by creating a classifier consisting of a collection of trees that are all used to vote for classification. 
RFs can substantially reduce variance compared to a single tree and result in high classification 
accuracy, but provides an uninterpretable black box that does not yield insight into the relationship 
between specific predictors and the outcome. Moreover, through our simulation results in section 
6, we found that random forests did not behave well in high dimensional low signal-to-noise cases. 

Our focus is on developing a new framework for nonparametric Bayes classification through 
tensor factorizations of the conditional probability P{Y = y\Xi = xi, . . . ,Xp = Xp), with Y G 
{l,...,do} a categorical response and X — {Xi, . . . ^Xp)' a vector oi p categorical predictors. 
The conditional probability can be expressed as a di x • • • x dp tensor for each class label y, 
with dj denoting the number of levels of the jth categorical predictor Xj. li p ~ 2 we could 
use a low rank matrix factorization of the conditional probability, while in the general p case 
we could consider a low rank tensor factorization. Such factorizations must be non-negative and 
constrained so that the conditional probabilities add to one for each p ossible X, and are fully fle xible 
in characterizing the classification function for sufficiently high rank. iDunson and Xing and 



Bhattacharva and Dunson (;2012h applied two different tensor decomposition methods to model the 



joint probability distribution for multivariate categorical data. Although an estimate of the joint 
pmf can be used to induce an estimate of the conditional probability, there are clear advantages 
to bypassing the need to estimate the high-dimensional nuisance parameter corresponding to the 
marginal distribution of X. 

We address such issues using a Bayesian approach that places a prior over the parameters in the 
factorization, and provide strong theoretical support for the approach while developing a tractable 
algorithm for posterior computation. Some advantages of our approach include (i) fully flexible 
modeling of the conditional probability allowing any possible interactions while favoring a parsimo- 
nious characterization; (ii) variable selection; (iii) a full probabilistic characterization of uncertainty 
providing measures of uncertainty in variable selection and predictions; and (iv) strong theoretical 
support in terms of rates at which the full posterior distribution for the conditional probability 
contracts around the truth. Notably, we are able to obtain near a parametric rate even in ultra 
high-dimensional settings in which the number of candidate predictors increases exponentially with 
sample size. Such a result differs from frequentist convergence rates in characterizing concentration 
of the entire posterior distribution instead of simply a point estimate. Similar contraction rate 
results in p diverging with n sett ings are currently only available in simple parametric models, such 



as the norm al means problem ([Castillo and Van Per Vaart. 20121 ) and generalized linear models 



(| Jiang. 20061 ). Although our computational algorithms do not yet scale to massive dimensions, we 



can accommodate 1, 000s of predictors. 
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2. Conditional Tensor Factorizations 

2. 1 . Tensor factorization of the conditional probability 

Although there is a rich literature on tensor decompositions, little is in statistics. The focus 
has been on two factorizations that generalize matrix singular value decompositio n (SVD). The 



most popular is parallel f actor analysis (PARAFAC) (H arshman, 1970: Harshm an and Lundv. 1994 : 



Zhang and Golub. 20011 ) . which expresses a tensor as a sum of r rank one tensors, with the mini- 
mal possible r defined as the rank. The second approach is Tuc ker decomp osition or higher-order 
singular value decomposition (HOS VD), which was proposed b v lTucker (196 61 for three-way data 
and extended to arbitrary orders by iDe Lathauwer et al. 1^20001) . HOSVD expresses a di x • • ■ x dp 
tensor A = {acj-.-cp} as 

di dj p 

hi=i hp=i j=i 

where G — {ghi - hj,} is a core tensor, with constraints on G such as low rank and sparsity imposed 
to induce better data compression and fewer components compared to PARAFAC. For probability 
tensors, we need nonnegative versions o f such decompositions and the concept of rank changes 



accordingly (jCohen and Rothblum. 19931) 



X 



The conditional probability P{Y — y\Xi = xi, . . . , Xp = Xp) can be structured as a do xdi x 
dp dimensional tensor. We will call such tensors conditional probability tensors. Let Vdi,....dp{do) 
denote the set of all conditional probability tensors, so that P G 'Pdi,...,dp{do) implies 

do 

P{y\xi, . . . ,Xp) > Oyy,xi, . . . ,Xp, P{y\xi, . . . , Xp) = 1 "ixi,...,Xp. 

y=i 

To ensure that P is a valid conditional probability, the elements of the tensor must be non- negative 
with constraints on the first dimension for Y. A primary goal is accommodating high-dimensional 
covariates, with the overwhelming majority of cells in the table corresponding to unique combina- 
tions of Y and X unoccupied. In such settings, it is necessary to encourage borrowing information 
across cells while favoring sparsity. 

Our proposed model for the conditional probability has the form: 

k\ hp p 

P{y\xi,...,Xp) = ^ ---^ Xh^h2...hpiy)Y[^rl^^{xJ), (1) 

hi — l hp — 1 J — 1 

with the parameters subject to 

do 

Xhih2...hp{c) ~ 1, for any possible combination of {iii, ii2, ■ ■ ■ , hp), 

c=l 

T^''^\xj) ~ I7 for any possible pair of (j, Xj). (2) 

h=l 



The fcj value controls the number of parameters used to characterize the impact of the jth predictor. 
In the special case in which kj = 1, the jth predictor is excluded from the model, so sparsity 
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can be imposed by setting kj — 1 for most j's. The representation ([T]) is many-to-one and the 
different parameters in the factorization cannot be uniquely identified. This does not present a 
barrier to our Bayesian approach and indeed over-parameterized models often have computational 
advantages in leading to simplified posterior computation and reduced autocorrelation in Markov 
chain Monte Carlo (MCMC) samples of para r neters of interest, such as the induced predictive 
distribution ( Bhattacharva and Dunson ('201l[ ). iGhosh and Dunson f2009( )). 

We format the conditional probability P{y\xi, . . . , Xp) as a di x • • • x dp vector 

Vec{P{y\-)} = {P{y\l, . . . , 1, 1), ...,1,2),..., P(y|l, . . . , 1, dp), . . . , 

P{y\l, . . . , dp_i, dp), . . . , P(y|di, . . . , dp_i, dp)}' 

and \hi,....hpiy) as a fci X • • • X fcp vector 

Vec{A{y)} = { Ai,...,i^i(y), Ai,... ,1^2(2/), ■ • ■ , 

Ai,...,i,fcp(2/), . . • , Ai^...,fcp_i,fcp(?/), . . . , Xki....,kpiy)} ■ 

Let TT*^^^ be a dj x kj matrix with Tri''\u) as the (m, u)th element. It is a stochastic matrix, so rows 
sum to one, by constraint ([2]). Then representation ^ can be written in vector form: 

Vec{Piy\-)} = (tt^i) (g) tt^^) ® • • • « T:'^P^)Vec{A{y)}, for y = 1, . . . , do, (3) 

where denotes the Kronecker product. Furthermore, if we let Mat{P) and Mat{K) be two 
stochastic matrices with the yth column Vec{P{y\~)} and V^ec{A(y)} respectively for y = 1, . . . , do, 
then we can write the above do identities together as: 

Mat{P) = (tt^^) ® 7r(2) ®---® tt^p^) M at{A) . 

The following theorem provides basic support for factorization ([T])-(I2]) through showing that any 
conditional probability has this representation. The proof of this theorem, which can be found 
in the appendix, sheds some light on the meaning of fci , . . . , fcp and how it is related to a sparse 
structure of the tensor. 



Theorem 1. Every do x di x d2 x • • ■ x dp conditional probability tensor P S Vdi....,dpido) can 
be decomposed as (QP, with 1 < kj < dj for j = Furthermore, )^hih2...hpiy) and irj^^Xj) 

can be chosen to be nonnegative and satisfy the constraints (0). 

We can simplify the representation through introducing p latent class indicators zi , . . . , Zp for 
Xi, . . . , Xp, with Y conditionally independent of (Xi, . . . , Xp) given (zi, . . . , Zp). The model can 
be written as 

Yi\zii,...,Zip Muhinomial({l,...,do},A2;i^...,z,p), 

z,,\X, ~ Multinomial({l,...,fc,},^P^(X,),...,7rif(X,)), (4) 

where Xzii,...,zip = {^zii....,zip{^), ■ ■ ■ , ^zii,....zip{do)}. Marginalizing out the latent class indicators, 
the conditional probability of Y given Xi, . . . , Xp has the form in ([T]) . 
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2.2. Prior specification 

To complete a Bayesian specification of our model, we choose independent Dirichlet priors for the 
parameters A = {Xh^^,,,ji^,hj = l,...,fcj,j = l,...,p} and tt — {t^^^^ {xj),hj — 1, . . . ,kj,Xj = 
l,...,dj,j = l,...,p}, 

{Ahi,...,/ip(l),...,A/ii,...^ftp(do)} Diri(l/do,...,l/do), 

{n^^\xj),...,7:i'^{xj)} ^ Diri(l/fc,,...,l/fc,),j = l,...,p. (5) 

These priors have the advantages of imposing non-negative and sum to one constraints, while leading 
to conditional conjugacy in posterior computation. The hyperparameters in the Dirichlet priors are 
chosen to favor placing most of the probability on a few elements, inducing near sparsity in these 
vectors. 

If kj = 1 in ([T]), by constraints ([2]) Tr[^\xj) = 1, P{y\xi, . . . ,Xp) will not depend on Xj and 
Y _L Xj\Xj',j' ^ j. Hence, I{kj > 1) are variable selection indicators. In addition, kj can be 
interpreted as the number of latent classes for the j th covariate. Levels of Xj are clustered according 
to their relations with the response variable in a soft probabilistic manner, with fci, . . . , fcp controlling 
the complexity of the latent structure as well as sparsity. 

To embody our prior belief that only a small number of fcj's are greater than one, we let 

P{k, = 1) = 1 - ^, P{k, =k) = JJ-^Tr^^ for fc = 2, . . . , d,J = 1, . . . ,p, 

where r is the expected number of predictors included. To further impose sparsity, we include a 
restriction that : kj > 1} < r, where f is a prespecified maximum number of predictors. We can 
choose the upper bound to correspond to twice the number of predictors we expect to be important, 
though in practice results tend to be robust to these hyperparameters unless f is chosen to be too 
small. Default values of r and f based on theoretical considerations are suggested in section 3.3. 
The effective prior on the kj 's is 

P(fci ^li,...,kp ^Ip) = P(ki = li) ■ --Pikp = Zp)/{|j{j,i^.>i}<p}(Zi, . . .,lp), (6) 

where Ia{-) is the indicator function for set A. Let 7 — (71, . . . ,7p)' be a vector having elements 
7j = I{kj > 1) indicating i nclusion of th e jth predictor. Under prior (|6]) the induced prior for 7 
is equivalent to the prior in I Jiang (20061 ). Potentially, we can put a more structured prior on the 



components in the conditional tensor factorization, including sparsity in A. However, the theory 
shown in the next section provides strong support for prior (IS])-®. 



3. Properties 

3. 1. Bias-variance trade off 

Because we are faced with extreme data sparsity in which the vast majority of combinations of 
Y, Xi, . . . , Xp are not observed, it is critical to impose sparsity assumptions. Even if such assump- 
tions do not hold, they have the effect of massively reducing the variance, making the problem 
tractable. A sparse model that discards predictors having less impact and parameters having small 
values may still explain most of the variation in the data, resulting in a useful classifier that has 
good performance in terms of the bias-variance tradeoff even when sparsity assumptions are not 
satisfied. We provide a simple illustrative example to demonstrate the tendency of our model to 
produce low MSE. 
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Suppose we have a binary response Y and p binary covariates Xj £ { — 1, 1}, j — 1, . . . The 
true model can be expressed in the form 

P{Y = l\X,=x,,...,Xp^Xp)^^ + ^x^ + --- + ^Xp, /3g(0,1). (7) 

The effect of Xj on the response Y decreases exponentiahy as j increases from 1 to p. A natural 
strategy is to estimate P{Y ~ l\Xi = xi,. . . ,Xp — Xp) by the sample frequencies over the first 
k covariates P{Y = l\Xi = xi, . . . , Xk = Xk) = tt{i : j/i = 1, xu = xi, . . . ,Xki = Xk]/'i,{i : xu = 
cci, . . . , Xki — Xk} and ignore the remaining p — k covariates. Suppose we have n = 2' {k < I ^ p) 
observations with one in each cell of combinations oi Xi, . . . , Xi. Under ((T]) , we can calculate the 
MSE of such an estimate as a function of k, and find a minimal MSE k value. 

MSE = E{P{Y ^l\Xi^ hi,..., Xp^ hp)- 

hi , . ...hp 

PiY = l\Xi^hi,...,Xk = hk)Y 
= Bias^ + Var, 
Bias^ = {P{Y ^l\Xi^ hi,..., Xp = hp)- 

hi,.... hp 

EP{Y = l\Xi=hi,...,Xk^hk)y 

2^p — k — 1 2 

Var = J2 VarP(r = l|Xi = /ii,...,Xfe = /ifc) 

hi,.... hp 

3 

Since there are 2^ cells, the average MSE for each cell equals 

i{(3 - /32)2*=-'-2 ^ ^22-k-l-2 ^ ^22-2fc-2 _ ^22-2p-2|^ 

3 

From this we can see that p, the number of covariates, has little impact on the selection of k. Recall 
that k < I and so the second term will be small comparing to the first and third terms. Hence, the 
average MSE obtains its minimum at fc w ^/3 = log2(n)/3. Even though the true model Q is not 
sparse and all the predictors impact the conditional probability, the optimal number of predictors 
only depends on the log sample size, with the number of predictors playing almost no role. This 
example also gives some intuition on the assumption of r„ ~ log(n) on the true model used in 
section 3.3. 

3.2. Borrowing of information 

A critical feature of our model is borrowing of information across cells corresponding to each 
combination of Xi, . . . , Xp. Letting Whi,....hp{xi, . . . ,Xp) = Yij ''^h^ i^j)^ model ^ is equivalent 
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to 

P{Y = y\Xi=xi,...,Xp = Xp)^ ^ Wh,^...^hp{xi,...,Xp)Xh,...hp{y), 

hi,...,hp 

and constraints ([2]) imply J2hi h Whi,...,h.p{xi, . . . , Xp) = 1. If \hi...hp(]j) is viewed as the frequency 
of F = y for the observations in cell Xi = hi,. . . ,Xp = hp, then our model essentially uses a 
kernel estimate that allows borrowing of information across cells via a weighted average of the cell 
frequencies. 

To illustrate the strength of this, consider a simple example involving one covariate X with 
TO categories and a binary response. Let Pj — P(Y — l\X — j) for j = l,...,m. A naive 
estimate for (Pi, . . . , Pm) is sample frequencies (fci/ni, . . . , km/rim), denoted by (Pi, . . . , Pm), where 
kj = ftli : Ui — 1 and Xi — j} and nj = |J{i : Xi = j}. Instead, we consider kernel estimates indexed 

by < c < 1/(to - 1) 

P,- = {1 - (to - l)c}P, + cJ^Pk, J = 1, . . . ,TO. 
We use squared loss to compare these two estimators. After some calculations, 

E{L{p, p)} = y: e{p, p,f = y: 



and E{L{P, P)} — X^Jli P'iPj ^ Pj)^ ^ function of c obtaining minimum 
E{L{P,P)] 



E{L{P,P)] 



m I E{L{P, P)} + ^i E.<,iP^ - P,? 
e[-E{L{P,P)],E{L{P,P)]], 

TO / 



at 

^ ^ 1 E{L{P,P)] ^ ^ 

' mE{L{P,P)} + ^^Y.^<,iP^-P,Y 



m — 1 



This suggests that when Pj's are similar, the estimate P can reduce the risk up to only 1/to the 
risk of estimating P separately. If Pj's are not similar, P can still reduce the risk considerably when 
the cell counts {nj} are small. 



3.3. Posterior convergence rates 

Suppose we obtain data for n observations y" = (yi, . . . ,yn)' , which are conditionally independent 
given X" = (xi, . . . , Xn)' with Xi = {xn, . . . , Xip^)' , Xij e {1,. . . ,d] and p„ ^ n. We exclude the 
n subscript on p when convenient and assume dj = d for j — 1, . . . ,p for simplicity in exposition, 
though the results generalize directly. Let Pq denote the true data generating model, which can be 
dependent on n. Rathe r than assum e that most of the predictors have no impact on Y, we consider 
the situation similar to iJiang C2006I) that most have nonzero but very small influence. Specifically, 
parameterizing the true model Pq in our tensor form with kj — d for j = 1, . . . ,p„ (this is always 
possible for any Pq), we assume: 
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Assumption A. J2^=i T^^'^xj J2h =2 (^i) ^ 

This is a near sparsity restriction on Pp- We additionally assume that the true conditional proba- 
bilities are strictly greater than zero, 

Assumption B. PQ{y\x) > Eq for any x,y for some eg > 0. 

The next theorem states the posterior contraction rate under our prior (O-®. We use f ^ g to 
mean that / is less than 5 up to a constant independent of n. Recall that r„,r„ are hyperparam- 
eters in the prior corresponding to the expected and maximum number of important predictors, 
respectively. 

Theorem 2. Assume the design points Xi,...,Xn are independent observations from an un- 
known probability distribution G„ on {!,..., d}^". Moreover, assume the prior is specified as 
in and Assumptions A and B hold. Let e„ be a sequence with e„ — > 0, ne^ — 00 and 

^^exp(— ne^) < 00. Assume the following conditions hold: (i) f n log Pn -< f^e^, (ii) fnd^" log{fn/en) -< 
ne^, (Hi) rn/pn — as n — >■ 00, and (iv) there exists a sequence of models 7„ with size fn such that 

Ej^7„ max^,^. J2t^=2^h-i^3) -< 4- Denote d{P, Pa) = J J^tU |-P(y|a;i,. ■ . ,Xp)-Poiy\xi, . . . , Xp)\Gnidxi, . . .,dxp), 
then 

n„{P : d{P,Po) > Me„|2/",X"} a.s.P^. 

The following corollary tells us that the posterior convergence rate of our model can be very 
close to n~^/^ for appropriate hyperparameter choices. 

Corollary 3. For any a e (0,1), e„ = n^'^"")/^ log n will satisfy the conditions in Theorem 
\^if Tn ^ fn -< logn, Pn ~< exp(n") and there exists a sequence of models 7„ with size fn such that 

Sj^7„ "^^^^^ T.%=2 4f (a^j) """^ fog^ n. 

The condition f„ -< logn is equivalent to n/d^"^ >- 1, which means that to obtain good ap- 
proximations to the true model, we cannot include too many predictors, and should ensure that on 
average there is order one observation in each cell. Based on the above observations, we recommend 
using r„ = log^(n),f„ = 2r„ as default values for the prior in applications. 

4. Posterior Computation 

In section 4.1, we consider fixed k — (fci,...,fcp)' and use a Gibbs sampler to draw posterior 
samples. Generalizing this Gi bbs sampler, we developed a reversible jump Markov Chain Monte 
Carlo (RJMCMC) algorithm (jCreen. 19951 ) to draw posterior samples from the joint distribution 
of fc = {kj : j 1, . . . ,p} and (A, tt, z). However, for n and p equal to several hundred or more, we 
were unable to design an RJMCMC algorithm that was sufficiently efficient to be used routinely. 
Hence, in section 4.2, we propose a faster two stage procedure based on approximated marginal 
likelihood. 

4. 1. Gibbs sampling for fixed k 

Under ([5]) the full conditional posterior distributions of A, tt and z all have simple forms, which we 
sample from as follows. 
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(a) For /ij = 1, . . . , /cj , j = 1, . . . ,p, update Xhi,...,hj, from the Dirichlet conditional, {Xhi,...,hpi'i^), • • • , Xhi,...,hpid)} \ 

Dirif - = /i^, . . . , = h^.y^ = 1), 

1 " A 

...,-+ ^ l(zii = /ii, . . . = /ip,j/j = d) j • 

(b) Update tt^^-' (/c) from the Dirichlet full conditional posterior distribution, 

{^(^)(fc),...,4f(fc)}|- ^ Diri - + = l)l(:r,, =fc), 

1 " \ 

..., — + V l(zy = fcj)l(a;jj fc) . 
''j' 2=1 

(c) Update from the multinomial full conditional posterior, with 
4.2. Two step approximation 

We propose a two stage algorithm, which identifies a good model in the first stage and then learns 
the posterior distribution for this model in a second stage via the Gibbs sampler of section 4.1. 
We first propose an approximation to the marginal likelihood. For simplicity in exposition, we 
focus on binary Y with do = 2, but the approach generalizes in a straightforward manner, with 
the beta functions in the below expression for the marginal likelihood replaced with functions of 
the form T{ai)T{a2) ■ ■ ■T{a(io)/T{ai + • • • + a^o). To motivate our approach, we first note that 
^h'^i^j) '^^^ be viewed as providing a type of soft clustering of the jth feature Xj, controlling 
borrowing of information among probabilities conditional on combinations of predictors. To obtain 
approximated marginal likelihoods to be used only in the initial model selection stage, we propose 
to force irl^^Xj) to be either zero or one, corresponding to a hard clustering of the predictors. 
Under this approximation, the marginal likelihood has a simple expression. 

For a given model indexed by A: = {kj,j = we assume that the levels of Xj are 

clustered into kj groups A'f^ , . . . , A^^ . For example, with levels {1, 2, 3, 4, 5}, A[^^ = {1,2,3} 

and A^"''' = {4,5}. Then it is easy to see that the marginal likelihood conditional on k and A is 
Ciy\k'A) = 

^ Beta(l)2,l/2) ^-^--(^ + g '^^^ ^ ' " ' ^ ^ ' = 



n s 

- + ^ I{x,r e <\ . . . , G <\ = 0) j . 



Having an expression for the marginal likelihood, we apply a stochastic search MCMC algorithm 
(jGeorge and McCuUoch. 19971 ) to obtain samples of [ki, . . . , kp) from the approximated posterior 



distribution. This proceeds as follows. 
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(a) For J = 1 to p, do the following. Given the current model indexed hy k — {kj : j = 1, . . . ,p} 

and clusters A = {A^jp : h = l,...,fcj,j — propose to increase kj to kj + 1 (if 

kj < d) or reduce it to kj — 1 (if kj > 1) with equal probability. 

(b) If increase, randomly split a cluster of Xj into two clusters (all splits have equal probability). 

For example, if dj =5, kj = 2 and the levels of Xj are clustered as {1,2,3} and {4, 5}. There 
are 4 possible splitting schemes: three ways to split {1, 2, 3} and one way to split {4, 5}. We 
randomly choose one. Accept this move with acceptance rate based on the approximated 

marginal likelihood. 

(c) If decrease, randomly merge two clusters and accept or reject this move. 

Estimating approximated marginal inclusion probabilities of kj > 1 based on this algorithm, we 
keep predictors having inclusion probabilities great than 0.5; this leads to selecting the median 
probability model, which i n simpler settings has been shown to have optimality properties in terms 
of predictive performance ( Barbieri and Berger. 20o"3 ). 



5. Simulation Studies 

To assess the performance of the proposed approach, we conducted a simulation study and calculated 
the misclassification rate on the testing samples. Simulated data consisted of = 2,000 instances 
with p = 600 covariates Xi, . . . , Ap, each of which has d = 4 levels, and a binary response Y. 
We assumed that the true model had three important predictors Xq,Xii and X13, and generated 
P{Y ~ l\Xg = xg, All = 2;ii, A'13 = zia) independently for each combination of (xg, xn, X13). To 
obtain an average Bayes error rate (optimal misclassification rate) around 15%, we generated the 
conditional probabilities from f{U) = U^/{U^ + (1 — U)^}, where U ~ Unif(0, 1). Each time, we 
randomly chose n samples as training with the remaining N — n as testing. We implemented our 
approach using the training set and calculated the test sample misclassification rate corresponding 
to the average MSE defined as 

aMSE-1 ^ {P(r = l|a;i,...,Xp)-F(y = l|a;i,...,Xp)}', 

Xi,...,Xp 

where P is the fitted conditional probability. We selected four training sizes n = 200, 400, 600 
and 800. For each training size, we randomly chose 10 training-test splits and used our two stage 
algorithm to fit the model for each split. According to our theoretical results, we chose r = log4(n) 
as the expected number of important predictors in the prior. We ran 1,000 iterations for the first 
stage and 2,000 iterations for the second stage, treating th e first half as b urn-in. In addition. 



we compared the results with the random forests algorithm (jBreiman. 20011 ) applied to the same 
training-test split data. 

Table 1 displays the results. In the very challenging case in which the training sample size was 
only 200, both methods had poor performance. However, as the training sample size increased, 
the proposed conditional tensor factorization method rapidly approached the optimal 15%, with 
excellent performance even in the n ^ p = 600 case. In contrast, random forests had consistently 
poor performance in this challenging setting involving a low signal strength, a modest sample 
size, and moderately large numbers of candidate predictors. In addition to the clearly superior 
classification performance, our method had the advantage of providing variable selection results. 
Table 2 provides the average approximated marginal inclusion probabilities for the three important 
predictors and remaining predictors for each training sample size. Consistently with the results in 
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Table 1. Testing Results for Synthetic Data Example. RF: random 
forests; TF: Our tensor factorization model. 



training size 


200 400 600 800 


aMSE 

Misclassification Rate of TF 
Misclassification Rate of RF 


0.144 0.042 0.024 0.010 
0.503 0.288 0.189 0.168 
0.496 0.482 0.471 0.472 



Table 2. Variable Selection Results for Synthetic Data Exam- 
ple. Columns 2-4 are approximated inclusion probabilities of 
the 9th, 11th, 13th predictors. Column 5 is the maximum inclu- 
sion probability across the remaining predictors. Column 6 is 
the average inclusion probability across the remaining predic- 
tors. These quantities are averages over 1 trials. 



training size 


9 


11 


13 


Max 


Average 


200 


0.092 


0.041 


0.063 


0.161 


0.002 


400 


0.816 


0.820 


0.808 


0.013 


0.000 


600 


1.000 


1.000 


1.000 


0.000 


0.000 


800 


1.000 


1.000 


1.000 


0.000 


0.000 



Table 1, the method fails to detect the important predictors when the training sample size is only 
n = 200 but as the sample size increases appropriately assigns high marginal inclusion probabilities 
to the important predictors and low ones to the unimportant predictors. This is just one initial 
set of simulations against one competitor, which is often thought to provide good performance in 
classification problems, but the results are promising. 



6. Applications 

We compare our method with other competing methods in three data sets from the UCI repository. 
The first data set is Promoter Gene Sequences (abbreviated as promoter data below). The data 
consists of A, C, G, T nucleotides at p = 57 positions for N = 106 sequences and a binary response 
indicating instances of promoters and non-promoters. We use 5-fold cross validation with n = 85 
training samples and N — n = 21 test samples in each training-test split. 

The second data set is the Splice-junction Gene Sequences (abbreviated as splice data below). 
These data consist of A, C, G, T nucleotides at p = 60 positions for = 3, 175 sequences. Each 
sequence belongs to one of the three classes: exon/intron boundary (EI), intron/exon boundary 
(IE) or neither (N). Since its sample size is much larger than the first data set, we compare our 
approach with competing methods in two scenarios: a small sample size and a moderate sample 
size. In the small sample size case, each time we randomly select n — 200 instances as training and 
calculate the misclassification rate on the testing set composed of the remaining 2, 975 instances. 
We repeat this for each method for five training-test splits and report the average misclassification 
rate. In the moderate sample size case, we use 5-fold cross validation so that each time n — 2, 540 
instances are treated as training data. 

The third data set describes diagnosing of cardiac Single Proton Emission Computed Tomogra- 
phy (SPECT) images. Each of the patients is classified into two categories: normal and abnormal. 
The database of 267 SPECT image sets (patients) has 22 binary feature patterns. This data set 
has been previously divided into a training set of size 80 and a testi ng set of size 187. 



As competitors we considered lasso penalized logistic regression (jPark and Hastie. 20071 ) and 5 
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Table 3. UCI Data Example. RF: random forests, NN: neural networks, SVM: support 
vector machine, BART: Bayesian additive regression trees, TF: Our tensor factoriza- 
tion model. IVIisclassification rates are displayed. 



Data 


CART 


RF 


NN 


LASSO 


SVM 


BART 


TF 


Promoter (n=85) 


0.236 


0.066 


0.170 


0.075 


0.151 


0.113 


0.066 


Splice (n=200) 


0.161 


0.122 


0.226 


0.141 


0.286 




0.112 


Splice (n=2540) 


0.059 


0.046 


0.165 


0.123 


0.059 




0.058 


SPECT (n=80) 


0.312 


0.235 


0.278 


0.277 


0.246 


0.225 


0.198 



black-box algorithms: CART, randoni forests (jBreiman. 20011 ). neural netw orks with two layers o f 
hidden units, support vector machines and Bayesian additive regression trees ( Chipman et al.. 2010( ) . 
Among them, BART was not implemented in the splice data since we were unable to find a multi- 
class implementation of their approach. 

Table 3 shows the results. Our method produced at worst comparable classification accuracy to 
the best of the competitors in each of the cases considered. Consistent with our previous experience 
in more compressive comparisons of classifiers, Random Forests (RF) provided the best competitor 
overall, justifying our focus on RF in the simulation examples above. We expect our approach to 
do particularly well when there is a modest training sample size and high-dimensional predictors. 
We additionally have an advantage in terms of interpretability over several of these approaches, 
including RF and BART, in conducting variable selection. For example, in the promoter data, our 
model selected nucleotides at 15th, 16th, 17th, and 39th positions as important predictors when 
the full data are used to fit the model. In the splice data, the 28th, 29th, 30th, 31st, 32nd and 
35th positions are selected. These results are reasonable since for nucleotide sequences, nearby 
nucleotides form a motif regulating important functions. In the SPECT data, the list, 13rd 
and 16th predictors are selected. It is notable that in each of these cases we obtained excellent 
classification performance based on a small subset of the predictors. 



7. Discussion 



This article proposes a framework for nonparametric Bayesian classification relying on a novel 
class of conditional tensor factorizations. The nonparametric Bayes framework is appealing in 
facilitating variable selection and uncertainty about the core tensor dimensions in the Tucker-type 
factorization, while avoiding the need for parameter tuning. In particular, we have recommended 
a single default prior setting that can be used in general applications without relying on cross- 
validation or other approaches for estimating tuning parameters. One of our major contributions 
is the strong theoretical support we provide for our proposed approach. Although it has been 
commonly observed that Bayesian parametric and nonparametric methods have practical gains in 
numerous applications, there is a clear lack of theory supporting these empirical gains. 

Interesting ongoing directions include developing faster approximation algorithms and gener- 
alizing the conditional tensor factorization model to acc ommodate broader fe ature modalities. In 
the fast algorithms direction, online variational methods ( Hofi^man et al.. 20ldh provide a promising 
direction. Regarding generalizations, we can potentially accommodate continuous predictors and 
more complex object predictors (text, images, curves, etc) through probabilistic clustering of the 
predictors in a first stage, with Xj then corresponding to the cluster index for feature j. 
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Appendix A: Proof of Theorem Q] 

Proof (Theorem [T]). First reshape P{y\xi, . . . , Xp) according to xi as a matrix A^^'^ of size 
di X dod2d^ ■ ■ ■ dp, with the h^^ row a long vector, 

{P{l\h, 1, . . . , 1, 1), P(l|/i, 1, . . . , 1, 2), . . . , P{l\h, 1, . . . , 1, dp), 

P{l\h,l,...,2,l),...,P{\\h,l,...,2,dj),...,P{do\h,d2,...,dp.udp)], 

denoted A^^^ {h, (y, X2, • • • , Xp)}. Let ki be the smallest number such that 

P{y\x,, ...,xp)^ (y, X2, . . . , Xp)} = ^ Altl...,(2/)4'^ («) 

h=l 

subject toX;^LiA|,^2^_^^(2/) = 1 for each (/i, X2, Xp), X^I^Li tt^^^ (a;i) = lforeacha;i, A,^^^j^_^^(y) > 

0, and tt\1\xi) > 0. In this non-negative matrix factorization, ki < di exists because Aj^^J ^ (y) — 

P{y\h ), ttI^^xi) = with kj = is a choice satisfying the above equations. Here 

Sij = 1 if i = J, (5ij = if i ^ J is Kronecker delta function. 

Taking Al^xa-.-ip (y) from ([5|) with argument X2, we can apply the same type of decomposition 
to obtain 

h=l 

subject to J2tLi >^x}h...xSy) = 1' ^^'^^ (xi, /i, . . . , Xp), X;^Li 4^^(a;2) = 1, for each X2, >^^x^h...xS^) > 
0, and 7r)j (0:2) > 0. Plugging back into equation ([8]), 

P(2/|xi,...,x,)= ^ E A,?1.3...,^(y)7r«(^i)<^(^2). 

/ll = l /l2 = l 

Repeating this procedure another {p — 2) times, we obtain equation ([T|) with Xhih2...hp{y) = 
^h!h2...hp(y) and constraints ([21). 



Appendix B: Proof of Theorem |2] 

To prove Theore m [2] we need some pr eliminaries. The following theorem is a minor modification 
of Theorem 2.1 in Ghosal et al. and the proof is included in a supplemental appendix. For 

simplicity in notation, we denote the observed data for subject i as Xi with Xi^<^ P ^ V , P ^ li, 
and the true model Pq. 

Theorem 4. Let e„ be a sequence with e„ — 0, ne^ — 00, X]n^^P(~^^n) ^ °o. i^ei d be the 
total variance distance, C > Q he a constant and sets Vn C V . Define the following conditions: 
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(a) logN{en,Vn,d) < nel; 

(h) T\^{r\Vn) < exp{-(2 + C)nel}; 

(c) n„(F: lllog-^lloo < 4) > exp(-Cne2). 

// the above conditions hold for all n large enough, then for M sufficiently large, 

n„{P : d{P, Po) > Me„|Xi, . . . , X„} ^ a.s.P^\ 

In our case, Xi include the response and predictors Xi, P is the random measure characterizing the 
unknown joint distribution of {yi, Xi) and Pq is the measure characterizing the true joint distribution. 
As our focus is on the conditional probability, P(y\x), we fix the marginal distribution of X at it's 
true value Pq{x) and model the unknown conditional P{y\x) independently of the marginal of X. 
By doing so, it is straightforward to show that we can ignore the marginal of X in using Theorem 
2 to study posterior convergence. We simply restrict P to the set of joint probabilities such that 
P{x) = Po{x). The total variation distance between the joint probabilities P and Pq is equivalent 
to the distance between the conditionals defined in Theorem 2 by the identity 



|P(y,xi, . . . ,Xp) - Po{y,xi,...,Xp)\dxi ■ ■ ■ dxp = 

/do 
\P{y\xi,...,Xp) - Po{y\xi,. . . ,Xp)\dGn{dxi, - ■ ■ ,dxp). 

Therefore, we will not distinguish the joint probability and the conditional probability and use P 
to denote both of them henceforth. 

To prove Theorem 2, we also need upper bounds on the distance between two models specified 
by ((ll when the models are the same size and when they are nested. 

Lemma 5. Let P and P be two models specified by (0) with parameter (A, tt) and (A, vr), respec- 
tively. Then 

do p 

d{P,P) < 51;,™'^^ \>^hih2...h^{y) ->^h^h2...h^{y)\ +c^o^max|7r,'/^.^(xj) -wl^^{xj)\. 

y = l '' j = l ^' ^ 

Proof (Lemma [5|) . By definition of d{P, P), we only need to prove that for any y — 1, . . . ,do 
and any combination oi (xi, . . . , Xp), 

\P{y\xi,...,Xp) - P{y\xi,...,Xp)\ < max \Xhih2...hp{y) - >^h,h2...hp{y)\ 

All , ... , il'p 



-^max|4f(x,)-4f(2;,)|. (9) 



Actually, 

p 

\P{y\xi,.-.,Xp) - P{y\xi,...,Xp)\ < A + '^Bs, 
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where 

fci kp p 

hi = l hp = l j=l 

ki kp p 

< ^max \Xh^h2...hpiy) ~'Xh^h2...hpiy)\ 51 XI H'^'^f^^j) 

^ hi = l hp = lj=l 

= max \Xh,h2...hpiy) - Xhih2...hp{y)\, 

fil ,...,/lp 

where the last step is by using the second equation in Q, and 

hi = l hp = l j=l j=s+l 

< max|4'j(a:j) 

where the last step is again by using the second equation in 1^ and the fact that \hih2...hp{y) < 1- 
Combining the above inequahties we can obtain 

Lemma 6. Let P and P be two models as in (0j with parameters (A, tt) and (A, tt), respectively. 
Suppose P is nested in P , i.e. there exists a number r s.t. 

Xht-h,.h,.+i-hp = Xht-h,.i-i, for h.j < d,j ^ I, . . . ,p, 

Then 

p d 

d{P,P)<do maxj] 7^i^.)(a;,). 

j=r+l ^ hj=2 

Proof (Lemma [B]). By the constraints on tt in 
\P{y\xi, ...,xp)- P{y\xi, . . .,Xp)\ 

d d p 



< E ■■'E,,™^^ \Xh^-h^i-i{y) -Xh,...hp{y)\ n 4v(^i) 

hr+i = l hp = l ' ' *" j=r+l 
d d p 

< E ■■'E;,™'^^ \Xhr-h^i-iiy) -Xh^...hpiy)\ n ^h-i^j) 

hr+i=2 hp = l ]=r+l 

d d p 

E ■■■E.'"^^ \Xhr-Ki-iiy) - Xh^...hpiy)\ TT 4f( 



hp— 2 J— r+1 

The lemma can be proved by noticing Xh^.-.h^iy) ^ [0, 1] 
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Proof (Theorem . We verify conditions (a)-(c) in Theorem|31 As we described previously, 
we do not need to distinguish the joint probabihty and the conditional probability under our prior 
specification. Let P„ be all conditional probability tensors having no more than f„ predictors, so 
that |7„| < 

Condition (a): By the conclusion of lemmaEl we know that an e„-net £"„ of Vn can be chosen so 
that for each (7, A, tt) S Vn that satisfies constraints ([2]), there exists (7, A, tt) S En such that 7 = 7, 

^oaal^y^h^,...,h^ \Xh^h2...h^{y) - >^h^h2...hAy)\ < (rJ+i)do ^^^^ max^^.,^^. kl^^ixj) - 4f (^i)l < (f„+i)do 
for J G 7. Hence, we can pick d-balls of the form 

hi,...,hp,y ^ \ / / j — lhj,Xj 

For each fixed size model with I7I < r„ in Vn, there are at most dprf'"" ^hih2...hp{y)'s and f„d^ 

Trj^^XjYs. For 7, there are at most models of size r. Hence, the log of the minimal number of 
size-e„ balls needed to cover Vn is at most 

log { (r-„ + 1)pI- } + fnd,d'-+' log irulD^ . 

By the conditions in the theorem, each term is bounded by some constant xne^, and we can adjust 
these constants to make it less than ne^. 

Condition (h): Because H„(7'^) = in our case, this condition is trivially satisfied. Actually, 
this condition will still be satisfied as long as H„(7 > f„) < exp{ — (2 + C)ne^}, which implies that 
the prior probability assigned to large models is exponentially small. 

Condition (c): As Pq is lower bounded away from zero by eq, || log -^||oo < is implied by ||P — 
Pol 1 00 < eofin for n large enough (e„ ^ as n increases). Let (A, tt) denote parameters for the true 
model Pq- Applying lemma[5]to bound d{P, P), where P(y|a;i, . . . , Xp) = Po(y|a;i, . . . , x^^, !,...,!), 
and then estimating the difference between P and Pq by lemma [51 we have 

d{P,Po)< max \Xhih2...h^„{y) ~ ^hih2...hf,A...i{y)\ 

hi .....hf^^ 

^ ' r... . (10) 



E max V^i')(x,) 



Combining (jlOp and condition (iv) in Theorem [51 and assuming without loss of generality that 7„ 



corresponds to the first r„ covariates, || log -^-Hoo < is implied by 



el 



max \Xhih2...hfSy) ~ ^hih2...hf^\...\{v)\ < - ,T. 

hx,...,hf^ r„ + 1 

2 

m.B.-^y^l\x,)-=k^}{xj)\ < 

Moreover, the Dir(l/d, . . . ,1/d) and Dir(l/do, • • • , 1/do) priors for \hih2...hr^ (') ^^id ■K^^\xj) have 
density lower bounded away from zero by a constant not involving n, and the prior P(7 = 7„) 
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is {rn/Pny^C^ - rn/Pny"'"''', SO as rn/Pn 0, logn„(7 = 7„) ~ f„log(r„/p„) > -f„logp„. 
Combining these and the conditions in the theorem, 

logn„ fp : II log ^lloo < e^) y r„dorf''" log ^" - f„ logp„ y -ne^. 

By adjusting constants, we can make logn„(P : || log ;^||oo < e^) > —Cne^. 

References 

Barbicri, M. M. and J. O. Berger (2004). Optimal predictive model selection. Ann. Statist. 32, 
870-897. 

Bhattacharya, A. and D. B. Dunson (2011). Sparse Bayesian infinite factor models. Biometrika 98, 
291 306. 

Bhattacharya, A. and D. B. Dunson (2012). Simplex factor models for multivariate unordered 
categorical data. J. Amer. Statist. Assoc. 107, 362-377. 

Breiman, L. (2001). Random forests. Mach. Learn. 45, 5-32. 

Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone (1984). Classification and Regression 
Trees. Belmont, CA: Wadsworth, Inc. 

Castillo, I. and A. Van Der Vaart (2012). Needles and straw in a haystack: Posterior concentration 
for possibly sparse sequences. Submitted to Ann. Statist.. 

Chipman, H. A., E. I. George, and M. R. E. (2010). BART: Bayesian additive regression trees. 
Ann. Appl. Stat. 4, 266-298. 

Cohen, J. E. and U. G. Rothblum (1993). Nonnegative ranks, decompositions, and factorizations 

of nonnegative matrices. Linear Algebra Appl. 190, 37. 

Do Lathauwcr, L., B. Do Moor, and .J. Vandcrwalle (2000). A multilinear singular value decompo- 
sition. SIAM J. Matrix Anal. Appl. 21, 1253-1278. 

Dunson, D. B. and C. Xing (2009). Nonparametric Bayes modeling of multivariate categorical data. 
J. Amer. Statist. Assoc. 104, 1042-1051. 

Genkin, A., D. D. Lewis, and D. Madigan (2007). Large-scale Bayesian logistic regression for text 
categorization. Technometrics 49, 291-304. 

George, E. and R. McCuUoch (1997). Approaches for Bayesian variable selection. Statist. Sinica 7, 
339-373. 

Ghosal, H., J. K. Ghosh, and A. W. Van Der Vaart (2000). Convergence rates of posterior distri- 
butions. Ann. Statist. 28, 500-531. 

Ghosh, J. and D. B. Dunson (2009). Default prior distributions and efficient posterior computation 
in Bayesian factor analysis. J. Comput. Graph. Statist. 18, 306-320. 



1 8 David B. Dunson 

Green, P. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model 
determination. Biometrika 82, 711-732. 

Harshman, R. (1970). Foundations of the PARAFAC procedure: Models and conditions for an 
'exploratory' multi-modal factor analysis. UCLA working papers in phonetics 16, 1-84. 

Harshman, R. and M. Lundy (1994). Parallel factor analysis. Comput. Statist. Data Anal. 18, 
39-72. 

Hoffman, M., D. Blei, and F. Bach (2010). Online learning for latent Dirichlet allocation. Neural 
Information Processing Systems. 

Jiang, W. (2006). Bayesian variable selection for high dimensional generalized linear models. Ann. 
Statist. 35, 1487-1511. 

Park, M. Y. and T. Ha.stic (2007). L-1 rcgularization path algorithm for generalized linear models. 
J. R. Stat. Soc. Ser. B Stat. Methodol. 69, 659-677. 

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. 

Methodol. 73, 273-282. 

Tucker, L. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika 31, 
279-311. 

Wu, T. T., Y. F. Chen, and T. Hastie (2009). Genome- wide association analysis by lasso penalized 
logistic regression. Bioinformatics 25, 714-721. 

Yang, C, X. Wan, and Q. Yang (2010). Identifying main effects and epistatic interactions from 
large-scale SNP data via adaptive group lasso. BMC Bioinformatics 11. 

Zhang, T. and G. H. Golub (2001). Rank-one approximation to high order tensors. SIAM J. Matrix 
Anal. Appl. 23, 534. 

Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. J. R. Stat. 
Soc. Ser. B Stat. Methodol. 67, 301-320. 



